Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized hash grouping by a single text column #7586

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

akuzm
Copy link
Member

@akuzm akuzm commented Jan 10, 2025

Use the UMASH hashes that have a guaranteed lower bound on collisions as
the hash table keys.
@akuzm akuzm mentioned this pull request Jan 10, 2025
10 tasks
Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 90.00000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 81.83%. Comparing base (59f50f2) to head (679c56d).
Report is 730 commits behind head on main.

Files with missing lines Patch % Lines
...des/vector_agg/hashing/hash_strategy_single_text.c 88.88% 1 Missing and 4 partials ⚠️
tsl/src/nodes/vector_agg/plan.c 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7586      +/-   ##
==========================================
+ Coverage   80.06%   81.83%   +1.76%     
==========================================
  Files         190      243      +53     
  Lines       37181    44828    +7647     
  Branches     9450    11184    +1734     
==========================================
+ Hits        29770    36686    +6916     
- Misses       2997     3732     +735     
+ Partials     4414     4410       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@akuzm akuzm marked this pull request as ready for review January 29, 2025 12:01
Copy link
Contributor

@erimatnor erimatnor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I added some suggestions for minor improvements. Also have some general questions.

Comment on lines +35 to +39
typedef struct BytesView
{
const uint8 *data;
uint32 len;
} BytesView;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining a new type, can we use StringInfo and initReadOnlyStringInfo? Just an idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not a complex type, and StringInfo has some unrelated things, so I'd keep it as is.

BytesView *restrict output_key = (BytesView *) output_key_ptr;
HASH_TABLE_KEY_TYPE *restrict hash_table_key = (HASH_TABLE_KEY_TYPE *) hash_table_key_ptr;

if (unlikely(params.single_grouping_column.decompression_type == DT_Scalar))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for my own understanding (not asking for any changes for this PR):

This is deep into vector aggregation, so I would expect that this function only gets passed an arrow array. But now we need to check for different non-array formats, including impossible cases (e.g, DT_Iterator). If we only passed in arrays, these checks would not be necessary.

The arrow array format already supports everything we need. Even scalar/segmentby values can be represented by arrow arrays (e.g., run-end encoded).

Now we need this extra code to check for different formats/cases everywhere we reach into the data. Some of them shouldn't even be possible here.

IMO, the API to retrieve a value should be something:

Datum d = arrow_array_get_value_at(array, rownum, &isnull, &valuelen);

This function can easily check the encoding of the array (dict, runend, etc.) to retrieve the value requested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have already regretted supporting the scalar values throughout aggregation. Technically it should perform better, because it avoids creating e.g. arrow array with the same constant value for every row, and sometimes we perform the computations in a different way for scalar values. But the implementation complexity might be a little too high. Maybe I should look into removing this at least for the key column, and always materializing them into arrow arrays. I'm going to consider this after we merge the multi-column aggregation.

The external interface might still turn out to be more complex than what you suggest, and closer to the current CompressedColumnValues, because sometimes we have to statically generate the function that works e.g. with dictionary encoding specifically, and that won't be possible if we determine this inside an opaque callback. We can't call an opaque callback (i.e. a non-inlinable dynamic function) for every row because it's going to produce significantly less performant code.

Comment on lines +113 to +116
const int total_bytes = output_key.len + VARHDRSZ;
text *restrict stored = (text *) MemoryContextAlloc(policy->hashing.key_body_mctx, total_bytes);
SET_VARSIZE(stored, total_bytes);
memcpy(VARDATA(stored), output_key.data, output_key.len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest making use of PostgreSQL builtin function:

Suggested change
const int total_bytes = output_key.len + VARHDRSZ;
text *restrict stored = (text *) MemoryContextAlloc(policy->hashing.key_body_mctx, total_bytes);
SET_VARSIZE(stored, total_bytes);
memcpy(VARDATA(stored), output_key.data, output_key.len);
MemoryConext oldmcxt = MemoryContextSwitchTo(policy->hashing.key_body_mctx);
text *stored = cstring_to_text_with_len(output_key.data, output_key.len);
MemoryContextSwitchTo(oldmcxt);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have something that can be inlined here, because it's a part of a hot loop that builds the hash table.

@@ -59,7 +59,7 @@ jobs:
build_type: ${{ fromJson(needs.config.outputs.build_type) }}
ignores: ["chunk_adaptive metadata telemetry"]
tsl_ignores: ["compression_algos"]
tsl_skips: ["bgw_db_scheduler bgw_db_scheduler_fixed"]
tsl_skips: ["vector_agg_text vector_agg_groupagg bgw_db_scheduler bgw_db_scheduler_fixed"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to skip these tests on windows? Is it also because of UMASH?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I didn't get it to compile there, so decided to disable for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants